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Abgtrmct 



£v%luator0 often utilize ANCOVA-type tedmiquee to assess the effects 
of innovative programs implemented in natoralistic settings* In this 
paper design^ analysis » aiid reporting considerations important to the 
application of ANCOVA-tyjje techniques in educational settings are 
described* Numerous examples are drawn from the national Follow 
Through evaluation^ and suggestions for improving reports utilising 
such ANCOVA-type techniijues are presented. The overall perspective 
ifl that evaluation reports must be more precise and must indicate the 
limitations as well as the ntrengths of the methodology used for this 
specifi^c setting. In doing so, a noore balanced descr^tion of a progr 
stnd its effects is presented to the decision maicer and to other stake- 
holders. 
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REPORTING RESULTS IN EVALUATION SETTINGS: EMi^^HASIZING 
SELECTED ISSUES IN ANCOVA ANALYSIS AND INTERPRETATION 



James L« DiCostanzo and R* Tony Eichelberger 

Learning Research and Development Center 
University of Pittsburgh 

Since the early 1960's, the federal government has authorized and 
funded numerous social action programs, many of which focusec on 
compensatory education. The evaluations of these programs have usu- 
ally been attempts to implement an experimental paradigm designed to 
maximize internal validity. Since manipulation of important variables 
is rarely possible (and often not appropriate) in evaluation settings 
(Cooley, 1978), some type of ^alysis of covariance (ANCOVA) tech- 
nique is frequently utUized to compensate statistically for the lack o± 
experimental control. 

Use and interpretation of the ANCOVA technique is extremely com- 
plete, requiring that numerous assumptions and conditions be met if 
meaningful interpretations are to be applied to educational settings. 
These assumptions are nev<-r precisely met in an evaluation setting, 
so the extent of the deviations and their impact on meaningful interpre- 
tations must be assessed and presented in the evaluation* 

The types of problems that arise from the use of complex data 
analysis techniques^ such as ANCOVA* that are addressed in this paper 
were identified from a review and critique of the evaluation of the na- 
tional Follow Through prograrci. There has been no attempt to compre- 
hensively identify the evaluation problems or the issues that relate to 
utilizing and reporting ANCOVA resixLts. The problems and issues 
discussed axe of recurrent concern to educational evaltxators in va.riou8 
settings. 



In this paper^ specific infonziiLtioii that should be included in an 
evaluation report whi^ji ANCOVA-type techniques are used is identified* 
This information should enable the reader to accurately assess the ade-» 
quacy of the technique and the appropriateness of the evaluator's inter- 
pretation of the results for that particular setting. Specific examples 
of the kinds of problems tnat arise when collecting data in school settings 
are described to illustrate the need for this additional information. Al- 
ternative ways of presenting the needed information in an evaluation 
report are presented and discussed* 

The comments and suggestions made in this paper follow prima z^y 
from the longitudinal evaluation of the national Follow Through program. 
This evaluation is a typical example of the application of an ANCOVA- 
type analysis technique in an evaluation setting. Because of its scope 
and duration (six years), the Follow Through evaluation encountered 
most of the problems that evaluators must face and that result from 
the ixse of this technique. 

Natiopal Follow Through Propjram 

A brief historical sketch of the national Follow Through program 
and its changing jpftxTp^BeB is needed to understand and appreciate the 
methodological issues discussed in the remainder of this paper. In 
1966, there were indications that Head Start, a federally funded com- 
pensatory education program for disadvantaged preschool children, 
was having some positive effects, but that the effects did not endure 
through the early elementary school years (Wolff & Stein, 1966). The 
Follow Through (FT) prograxn was jJLanned as a massive service pro- 
gram and was designed to extend compensatory education (similar to 
that afforded the Head Start children) from kindergarten through grade 
three (Johnson, 1967). When FT was originally funded, only $15 million 
was appropriated for two years, rather than the $120 million that was 
e^q^eted* 
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To the Office o£ £ducmtion» Follow Through then became a planned 
variation e^rperiment in which diverse types of innovative programs 
were implemented in various sites throughout the U. S. But« rather 
than assigning programs randomly to sites or projects as in a co^itrolled 
experiment^ participating local districts » in cooperation with the pro- 
grams* sponsors^ were allowed to select the instructional model to be 
implemented in their project. Although this procedure later caused 
some methodological problems, it is probably more representative of 
the operation of U. S. public schools than is the random assignment of 
programs to sites. 

In the initial two years of the FT program (1967-68 and 1968-69), 
the evaluation focus was somewhat confused^ due primarily to the 
c h a n ge in the program emphasis from service to a planned variation 
ejcperiment and the associated administrative problems. In 1968^-69, 
several purposes for the national Follow Through evaluation were 
delineated* including: (a) assessing program impact on pupils^ parents, 
schools, and commxintty (Bmrick, Sorensen, & Stearns » 1973, p. 72); 
(b) assessing relative effectiveness of different programs and program 
approaches (Sorensen & Madowp 1969, p. 4); and (c) establishing cri- 
teria for effectiveness and success of the national FT program (Soren- 
sen & Madow, 1969^ p. 4). 

In. this paper p we are concerned with selected aspects of these 
three purposes, which deal with the impact of the FT programs. The 
evaluation attempted to accomplish these purposes by using ANCOVA* 

Appronmately 70 of the 170 local projects representing 14 of 22 
FX sponsor models were included in the national FT evaluation. In 
each FT school district, students identified as simdlar to those partic- 
ipating in FT comprised the Non-FoUow Through (NFT) sample and 
were tested on a regular basis by Stanford Research Institute (SRI), 
the organization contracted to collect all FT evaluation > t? it ?i . When 



comparable students could not l>e identified locally, a comparison or 
control group from a neighboring school district was identified and 
tested* Noncomparability of the FT and NFT groups at a particular 
site was often a result of the school district's policy of assigning the 
most disadvantaged children to the FT program* Noncomparability^ 
for this and other reasons » was an ongoing problem in the evaluation 
that the use of ANCOVA attempted to alleviate, despite the lack of 
randomisation in the design. 

Decision makers associated with the early years of FT were con- 
fident that the program would have a marked impr.ct on the participating 
children* Richard £gbert (1973), the original FT Director, indicated 
that the evatuiation design was based on the conviction that: 

children* s development would be so markedly superior as 
to be readily demonstrated on measures of achievement^ 
cognition, self-concept, social matxiration, and capacity 
to function independently « Follow Through' s design was 
born also from the conviction that xmless such substantial 
differences were manifest, the really zziassive increases 
in spending that would be required could not be justified, 
(p. 25) 

These convictions seem to have resulted in less concern with details 
of the design, since it was believed that any reasonable evaluation of 
FT would readily show the impact and effectiveness of the program. 

The FT evaluation has vacillated in emphasis from a decision 
orientation of identifying the **best'* model(s) overall to a descriptive 
orientation in which different effects of individual models would be 
described. Initially, SH.I was awarded an evaluation contract to iden- 
tify the most effective program model(s) and to prov.de descriptive 
information to project administrators and other school administrators. 
At various times, it was decided that a consumer's guide, which would 
list individual sponsors' objectives an^. the degree to which the objec- 
tives were met, was to be produced by SRI. Since 1972, the major 



objective of the national FT evaluation has been to Identify the auccesa- 
ful model(s) and to document the impact of the models on pupils. An 
ANCOVA-type procedure has been utUixed for this purpose. 

SRI and Abt Associates^ the major contractors for the longitudinal 
evaluation of the impact of FT, have produced four reports. The SRI 
report covered the interim years of FT, 1969-71. Abt Associates 
have produced four reports covering the years 1972 through 1975. The 
SRI report C^mrick et al. , 1973) and the third Abt report (Stebbins, 
1976) are used for illustrative purposes in this paper. For an extent 
sive review and critique of the FT evaluation^ see Haney (1977). 

Analysis of Covariance (ANCOVA ) 

As indicated above, ANCOVA is often used in evaluation settings 
where it is difficult or impossible to control e3q>erixxientally alternative 
explanations of educational outcomes. In situations where its use is 
appropriate^^ it allows groups to be compared on a criterion variable 
that has been adjusted on a set of conconaitant variables, or covariates. 
Statistically, ANCOVA is used to increase the precision of the analysis 
by taking advantage of the linear relationships between the dependent 
variable(s) and the covariat'-^s). In order for ANCOVA to be unambig^ 
uously used, however, its assumptions and conditions must be pre- 
cisely met. Failure to do so may distort the results in ways that make 
their interpretation equivocal, if not meaningless. ^ 



Abt Associates' evaluation report (Stebbins^ 1976) discussed 
several problems associated with the analysis of data collected in the 
FT evalT2ation setting. We have selectively drawn examples from that 
report to illustrate otir points. As a result, our paper tends to em- 
phasize only the most questionable analysis and reporting procedures 
in the Abt report. Abt had the very diffictilt task of attexx^ting to draw 
conclusions from a complex non-experimental setting. See Appendix 
A for a brief statement of Abt*s view of their role and situation (Stebbins, 
1976, p. A-46). 
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We believe th&t the consumer of the ev&lujktlon report must be able 
tor (a) assess the appropriateness of ANCOVA whenever it is used« 
and* (b) examine possible alternative interpretations of the results* 
For these purposes* Information regarding the conformity or noncon* 
formity to the assumptions and conditions of ANCOVA* and other 
information that would enable alternative interpretations of the results 
to be made* must be available in the report. 

Areas of Concern 

We have delineated some information we believe is necessary for 
the reader to achieve the two pxirposes stated above ^ and we have or* 
ganized it into five topical areas. Each area is focused by one or more 
questions that the e valuator should address. 

How are the Specific Research Hypotheses Investi- 
gated and the Results of the CorresT>onding ANCOVA 
Data Analyses Related to the General K valuation 
Question(s)? 

It is generally accepted that no empirical process completely as^ 
sesses an event* and evaluation is no exception. With limited reao * «es 
especially of time* money* and personnel* an evaluation can only ad- 
dress some aspects of a general evaluation question. 

An evaluation is defined by the specific research questions or 

hypotheses that are investigated. The selection of hypotheses to be 

tested or questions to be addressed is the result of a reasoning pro- 

2 

cess that links the research hypotheses to the general question. 



In an evaluatioii* the variables utilized are usually specified at 
three different levels^. First* the general area of focus* such as pro- 
gram impact on participants or program, effectiveness* is delinea t<r^d , 

(cent. ) 



Thm 0xpUcatio& ot thia reaaoaiog proc«tt«» or ratiooAla, permits th^ 
rmmdmr of thm #vmlu«tlon raport to identify and the component* 

that are included ae well ae those that are not, in order to anewer the 
••aeral evaluation question. This explication is crucial, especially 
in large-scale evaluations where the inferential process relating the 
overall question to the specific research hypotheses is extremely 
complex and not obvious, especially to the reader. 

One of the general impact questions specified by SRI (Bmrick et 
aU , 1973) for the national FT evaluation was, "How effective is Fol^ 

Through as a method of improving the life chances of participat- 
ing children?" (p* 72), Three research questions concerned with the 
academic performance of FT pupils and attitudinal changes of their 
parents and teachers were delineated to address this general question* 

How these academic performance and attitudinal change variables 
relate to improved life chances is not immediately apparent* A ration- 
ale that relates them is needed to enable the reader to gain an appro* 
priate perspective for viewing the evaluation results. Cohen and 
Caret (1975) describe one line of reasoning in their article on social 
policy research: 

In the late 1950's and earl - 1960's. for example^ a national 
policy concerning educatioul opportunity began to take 
shape* It rested partly on the idea that poverty » unemploy- 
ment and delinquency resulted from the absence of particular 
skills and attitudes — reading ability^ motivation to achieve 



Next, the specific aspects of the area of interest that are to be inves-- 
tlgated are specified as the questions to be addressed* Finally, each 
question is addressed by one or more statistical analyses* We are 
calling these levels: (a) the general, or overall, evaluation questions, 
fb) the research questions, and (c) the statistical hypotheses, which 
are operationalized by the actual data analyses carried out. 
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in soliool And the like« wsls slLbg a.s sumption 

that schools incxzlcated these skills and attitudes and 
that acquiring thexxi woiild lead to economic and occu- 
pational success. In othe^ words^ this policy assum.ed 
that doing well in schools led to doing well in life. 
(P- 21) 

By specifying the irationale used^ the e^aliaatoir clarifies the view- 
point on which the evaluation is hased and enahles the reader to rmder** 
stand the intentions of the evaluation* Whether or not the reader 
agrees with the evaluator's logic is not the important issue; we be- 
lieve that scrutiny of it is necessary for the reader to assess and 

from the investigation. 

The need to specify the link Between the research hypotheses and 
the overall evaluation questions has been discussed* Similarly, speci- 
fying the relationship between the statistical hypotheses actually tested 
and the corresponding research hypotheses is needed* Often the sta- 
tistical hypotheses tested are not stated in the evaluation report. In 
evaluation studies or analyses that are not conapleac, the specific hy- 
pothesis that is tested can easily be infer^red from a description of 
the analysis performed* This is a much more difficult te.sk when 
multiple dependent and concomitant variables are analyzed or num.-* 
erous analyses are used to investigate each research question* 

^^bt Associates' national evaluation of JPT (Stebbins, 1976) is a. 
good example of a .c^tziplejc evaluation utilizing nximerous sophisticated 
analyses. An example from this evaltxation that illustrates the prob- 
lem and indicates an approach for dealing with it follows. One of the 
general evalioation questions addressed in their report was, **Does 
Follow Through have a greater impact on disadvantaged children 
do regular school programs?" (Stebbins^ 1976, p* A-8)* The impact 
question was addressed by a number of ANCOVA analyses comparing 
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FT with locals best-znatch, and national FX groups. The results are 
reported In what was called Snmzziary o£ Effects tahles (see Tahle 1)* 



Tabfe 1 

Sample Summary of Effiects Table 
{Sttbbins^ 1976. p. A-«) 



Site A 



SitaB 



Total Reading 

Total Math 

Spelling 

Language 

Raven's 

Coopefsmith 

tARS (— ) 



Local Mctchwj PooM 



I ocal Atotched Pooled 



Word 
Reading 
Math 

Math Oomputstionc 
Math Problem Solvtng 
Languaga Part A 
Language Fmrt 8 
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Fifteen analyses were znade for each Follow Tiirough site reported- 
one for each variable listed in Table !• An escaznple of a research 
question might be stated a s»: 

Is the mean reading achieveinent test score of par- 
ticipating FT students greater than that of NFT students 
when the effects of: 

a. Fall kindergarten WRAT 

b. First language 
c* Family income 

d. ffighest occi^ation in family 

e. £Ithnic membership 

f. Sex 

g* Hntry age 

h. Missing data code for WRAT 

i* Missing data code for income 

j. Missing data code for occupation 
are statistically controlled (where reading achievement 
is defined as the Total Reading score of the Metropoli- 
tan Achievement Test, which is comprised of the Word 
Knowledge and Reading subtests)? 

As indicated in Table 1> coznparisons were made between the FT 
students at each site and three different NFT groups; local*, best-m^^tch, 
and pooled« Nine of these comparisons deal directly with the question 
of impact on readingr three each for Total Reading, Word Knowledge 
and Reading, The latter two of these are subtests that make up the 
Total Reading score. Cf course* none of the six comparisons involving 
the Word Knowledge and Reading subtests are independent of the Total 
Reading Comparisons* but this is not specified in the table of effects 
or associated discus sion« 
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Wh.eii the specific reseaircii hypothesis addressed or the statistical 
hypothesis actually tested is xiot stated, the reader is left with the 
vague iznpressioja that everything that shoiild have been controlled 
was controlled, and the nnmeroixs comparisons r^>orted mixst have 
assessed the FT program effects on reading rather comprehensively. 
"WTe are sure that the authors did not mean to leave that impression, 
and they assumed that any sophisticated reader would interpret their 
analyses anl interpretations apprc>priately and with much caution- 
given the nimierous caveats and explanations included in the first part 
of the report. But, in any 400 page" report with an additional 400 pages 
of appendices, the reader will have difficulty figuring out how the scores 
that define reading were obtained, what they represent, and what the 
^valuators think they represent. The same problem exists for each of 
the ten or more covariates. 

This is a complex and difficult problem faced by every e valuator 
at one tim e or another, and we do not want to address issues about the 
role of evaluation and of evaluation reports. Our concern is that eval- 
uation reports describe as clearly as possible the evaluation activities 
undertaken to answer the general evaluation questions and coznzrxunicate 
as precisely as possible the relationships between the general evalua- 
tion questions, the research h^^potheses, and the statistical hypotheses 
actually tested. Ambiguity in a massive, complex evaluation tends to 
communicate to the reader that everything was done that could possibly 
be done and the e valuator ' s conclusions are th "best^* interpretations,- 
if not the only appropriate interpretations of the data. There are al- 
ways pressures to make the evaluation as convincing as possible, whether 
positive or negative resxzlts are obtained, because the client 'paid for 
the evaluation. This often results in gross overstatements of findings 
or of the confidence one should have in the findings and often does not 
represent well the situation that is being evaluated. By specifying the 
general evaluation questions, the research and the statistical hypotheses. 



ajid the evaloator^s view of the rd3.tionsh^0 amoxig thexxx, both the 
strengtha and w^aknes ses of a, rmmplejc evalnation can 1>e clarified^ 
The limited empirical iziformation presented in the resultant evalua- 
tion report c an then be used more appropriately by decision makers 
and be more -asefol to educational professionals* 

Are the Variables Defined > the Rationale 
and the procedure for Selecting the Measures 
Described^ and are the Relationships Among 
the MeasTxreSy Variables ^ and Evaloation Oaes" 
tions Specified? 

In general^ three relationships are of concern in the measTxrement 
area: (a) variable /domain, (b) iostrmnent /variable, and (c) instrument/ 
domain* Hach of these has associated with it an inferential gap that 
must be bridged in order to relate the empirical results to the intended 
purposes of tiie evaluation. The rationales that delineate these relation* 
ships must be specified in the r^>ort so the reader can best assess ^e 
adequacy of the instrumentation. 

A major iss*ue in the measurement area is the conflicting consid-* 
erations related to the * 'importa n ce * ' and *'scope" (Stufflebeam., Foley, 
Gephart, Guba, Hammond, Merriman, & Frovus, 1971) of the data 
collected aund reported. Tjnportance deals with eznphasizing the most 
important information in a particular situation and eliminating that 
which is not valued. Scope is the concern about the entire :range, or 
comprehensiveness, of the information included in the evaluation. 
Decisions must be made about each possible type of evaluative infor- 
m.ation and datum to be included. As decisions are made abotit domains, 
variables, and m.easures, practical considerations of time, money, 
adequacy of m.eagur em ent procedures, etc. , tend to limit the evalua* 
tion to the most important variables and zneasures. At the same time, 
concerns about adequately fulfilling the purposes of the evaluation 
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tend to e3q>an<i its scope. Sut, nuzxierous xziodificsttioiis and compro» 
mises in ea^ch ax'ea are always znade* 

In the national FT evaluation, two domains (cognitive and non- 
cognitive) were identified for stodent ontconces. Sozzie of the vari- 
ables and xueasixres used to assess the domains are listed in Tal>le 2* 

Table 2 

Domains*. VanableSr and Measures Used in 
National Evaluation of Follow Through Program^ 

P?"!?*^ ^ -^^ Variable _ Mteasure 

Cognitive Total Reading Metropolitan Achievement Test 

Total Math 



Cognitive 



Noncognitive 



Total Reading 

Total Math 

Spelling 

Language 

Problem Solving 

Self-Concept, or 
Self-Esteem 

Locus of Control 



Metropolitan Achievement Test 
Metropolitan Achievement Test 
Metropolitan Achievement Test 
Metropolitan Achievement Test 
Raven's Progressive Matrices 
Coopersmith 

Individual AcWevement Respon- 
sibility Scale 



^These were used to assess third-grade affects in the Abt evaluation (Stebbins^ 1978). 



SRI identified two domains (cognitive and noncognitive} as opposed 
to the three domains (basic skills, cognitive conceptual skills, and 
affective) identified by Abt« In this paper, to siraj^ify the discussion 
we deal only with the cognitive /noncognitive distinction — even though 
the three domains more adequately naatch the different sponsors* 
objectives* 
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All izieMTz^es nmeA ixr «rvmloatioa xmut be speeifl^d ajad described, 
utd the speci^c va^rimbles cozistmcted from tbese zziea^rares zxioet be 
defined^ The gpecifjotioii o£ tixe vmxlmbles and the measures o£ theoa 
caja Txstially be done eas£Ly by tising a table snch as Table 2. When the 
'Trariables are defined as tesl:s or s^tests of standardiaeed tests^ a short 
description of the test and the scores actually analyzed is Yxsnally ade* 
qiiate to e n abl e the reader to understand how each variable is being 
operationaJly defined* 

Whenever an evaluation is planned^ a wide range of domains and 
var£cl3Xes are initially identified for p6 s s ibie inclu sibiau 'Often domains^ 
variables, and measures are excluded during the selection process^ 
The evalTiation contractor is usually most knowledgeable about the 
compromises and deletions that are made. A discussion of this selection 
process is seldom, if eyer» induded in an evalxxation report* Thus, the 
best thinking a.bout this problem and the rationales for the decisions are 
lost to the field and to society* They are also not available to the 
readers, including znajor decision maikeTB in Congress, who need that 
informa^tion so that they can more appropriately assess the relative 
value and importance of the conclusions of an evaluation report as they 
relate to decision alternatives* 

An example of such a discussion appeared in Design for the Indi^ 
vidttalized Instruction Study (Cooley Sc X#einhardt, 1975b}* The first 
tiro pages of their rationale for excluding noncognitive variables in 
their evaluation design are included as Appendix B of this paper* It 
indicates the steps that were followed and the criteria they used to 
arrive at their -r-^ ^^rn^i r^i ^^a^^h r%rt ^ in Section 3 of their report, Cooley 
and Leinhardt present -dieir rationale for using a standardized achieve- 
ment test to assess cognitive outcomes* The criteria utilized to compare 
possible tests are delineated* The actual test reviews are included in 
an appendix of liieir report, where tixe subtests of each achievement 




batteryp the psycboxxxetric chax-a.cteristics^ the xioxxxis available, and 
other characteristics are described* However, there is very lit^e 
discxLssion by the axitfaors of the inadequacies of the test battery that 
was to be used in the evaluation. 

This example ^ves a rationale for and describes the relevant 
r ela tionships between the domains , variables, and xzieasures to be 
included in this evaluation. In our view, it would have been helpful to 
indicate more fully the strengths and xxxadequacies of the achievement 
battery in assessing the specific variables and the cognitive, or achieve 

Since neither Abt nor SRI described the procedures and rationales 
that led to delineating the variables and measnres utilized in the FT 
evaluations, we have identified some aspects that seem to have been 
consxdered* 

1* JPollow Through is an attempt to ejctend the positive effects of 
the Head Start prograxn* Variables similar to those investigated in 
the Head Start evaluation should be included* 

2« Follow Through as a compensatory education program has as 
its primary emphasis the improvement of students* basic skills, which 
in the first three grades are reading and mathenar'-jics* 

3. The 22 FT sponsors have dis cerziably different approaches to 
early childhood education. Domains and variables were identified 
from their TTiain program objectives. 

4. Given the amount of time and money allocated for the FT 
evaluation, only the most important and nsable variables could be 
investigated* Thxis, some important variables that are of interest 
cotald not be included because valid and reliable measnres of them 
were not available* 

A discussion of each of the considerations used to make decisions 
and judgements on the adequacy of the domains and variables is needed 
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if the complcjc ajaalyses and i nterpretationa are to be meaningrfully 
Txnder stood and ^ztilized. Tlie evaluation report aJbiould indicate liow 
these and other considerations affected variable selection and should 
include the rationales for the choices -made* When these considera- 
tions are not included in a large complex evaluation, the reader is 
often left with the izzipression that all important domains aind varia- 
bles were included in the evaluation^ and the zneasures used did ade- 
quately (and com^prehensively) represent them^ 

What Criteria Were tJtilized to Decide if a 
Specific ANCOVA Analysis Should be Made 
and Interpreted ? 

!♦ To what e^ent does each comparison meet these criteria? 

E. What are the effects on the interpretation of results of the 
failure to meet the criteria? 

The primary reason that ANCOVA-type procedxires are used in 
evaluation settings is to adjust for, or statistically control^ other 
likely explanations for the outc oxnes that are assessed* 5ut« alterna* 
tive ejcplanations are still present after the analyses have been com- 
pleted, whether or not randonnixation was used. When there is little 
or no experimental control- -such as in naturalistic field studies or 
evalxiations-- outcomes are even more difficult to interpret and explain. 

In naturalistic field settings^ such as FIT, deviations from the 
assumptions and conditions necessary to apply and interpret ANCOVA 
with some degree of precision, such as homogeneity of regression or 
similarity of groups in the analysis, are often present* The evaluator 
must attempt to assess the extent of the deviations and to delineate 
criteria for deciding when a specific analysis should not be interpreted* 
establishing the specific values for criteria is an admittedly subjective 
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process, a.s there is little gxxid^ce available in the literature. These 
valxxes zxmst he based on the pi2rpo8es of the evaluatioii and on the 
spe rrrfr c situations in which the evaluation is occurring. In section, 
we discuss several criteria -diat should be considered, and present 
methods of reporting thezxi. 

The first consideration that xzxust be zxiade, especially in a longi^ 
tudinal evaluation, is whether the data in hand are representative of 
the situation being evaluated* In tiie FT evaluation, non-randozxx attri* 
tion was often a znajor problem. iVfter three to four years, less than 

percent of the initial JTT aample had complete data in some sites* 
A criterion xhat was implemented by the Abt evaluation t^?rm was that 
both the FT and NFT groins be comprised of at least 12 students* This 
small sample size would, of corpse, overfit the statistical model, 
especially when seven to ten covariates were used; but, at least the 
criterion value (12) was ej^licitly stated. The actual sample sizes of 
the samples included in specific analyses can usually be presented in 
a table su mm arizing the results (or ''effects, " in Abt's terminology)* 
In addition to sample size, this table should report what proportion of 
each site's participating students comprise its sample* The size of 
this proportion directly influences the appropriateness of conclusions 
drawn from the analyses.; The questic^n of proportion of participants 
needed is a related but complex concern that will xx>t be addressed 
here, but should be considered in each evaluation se^ng* Other con-^ 
side rations about the representativeness of the data that might be of 
concern are discussed in the '^^^'^s*'* section of ^i*^ paper* 



^nplicit in much of the literature on ANCOVA is that it should 
not be used in non-e3q>er£mental situations, but this is extrem^e. Se^ 
elective apj^cation and cautious interpretation are a more practical 
and useful approach to using this and other statistical methods* 
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A second set of caritej ia are the sosctzxiptions ANCOVA. The 
£002* considerations that axe usizaUy iznport^uxr were identified in Abt's 
KT report* 

1. The cova^iates arc xininflnenced by treatment; 

2. The distribution of the covariates is not £:rossly dij^erent 
across ^groups; 

3. The relationships between covariates and criteria are 
the sazxie (honaogeneous); and, 

4- The covariates are perfectly reliable* (Stebbins, 1976, 
p. A-58) 

Each of these as suzziptions was investigated or dLscixased in Abt'a 
evaluation report. The first assxucxiption was investigated by rerunning 
ANCOVA cozzsparisons without the one covariate (WRAT) that they felt 
could be influenced by the treatment. Violation of this assumption 
zneant that the portion of the treatzzient effect that was confounded with 
WkAT was being inappropriately removed. The report states: 

If the WRAT is influenced by the first few weeks of treat- 
ment, one might expect pretest adjustments to handicap 
tiie FT children. To test this we removed the WRAT 
from the covariate set and reran the local analyses* 
The results of these **no-WRAT covariate" analyses do 
not differ in any important ways £rom analyses wiiich in- 
cluded WRAT as a covariate. We conclude from this 
c onq Mt r ison that the WRAT is probably not hindering our 
analyses of program effects. (Stebbins, 1976, p. A-59) 

The rerunning of the ANCOVA analyses for each site was useful 
in addressing whether the use of the available WRAT data as a eo* 
variate affected the results obtained and conclusions reported. It is 
important to note that dropping any one of ten interrelated covariates 
is unlilcely to affect the resnlts of an analysis, given the relatively 
high correlations among covariates. Dropping the WRAT does not 
directly address the assumption that the covariates were not influ- 
enced by treatment. This assumption could be attended more directly 



by giving the pretest earlier, such a.8 in the previaaa yea^^ 'before the 
prograzsa wais izxipleznented, or in the first two weelcs of prograsa izxiple 
mentation* Or^ it could be addressed in a pilot study before the evalu- 
a.tion is undertaken* 

Our own escperiences at the l^ea^ming Kesea^rch and Development 
Center (LRDC) have con\rr *iced us that ^uch learning, a^ assessed by 
paper a.nd pencil achievenrent tests, occxirs in Ae first few weeics of 
our program* In a. study in Pittsburgh area, schools (Eichelberger, 
DiCostanzo, & £va.lumtion Staff, 1975), students using the LRDC cur- 
ricula, siznilar to that used in FT were assessed in ^e sijcth weeic of 
school (as were a group of similar students) using the Metropolitan 
Readiness Test OMRT). The results obtained ore reported in Table 3* 



Table 3 

Fall Metropolitan Readiness Test Results for 
Kindergarten Students in LROC and Comparison Schools 





Fall Mean 


N 


LROC School 1 


36^ 


59 


Comparison School 1 


28.32 


63 


LRDC School 2 




42 


Comparison School 2 


22.69 


42 


LRDC School 3 


29.32 


87 


Comparison School 3 


23.03 


130 



These results indicate that the three schools using the ISLJDC cnrricnla 
during the first sisc weeks of kindergarten scored znnch higher than 
similar students who had not used that curricula^ These resi2lt8 



sixggest problems with the a^ssmx^tioii thatt the fall kindergarten WRAT 
scores were umadEfected by six weeks of treatment. 

If the pretest was differentially affected by the treatments in FT, 
use of the ANCOVA«-like procedixres to test other assimaptions and to 
adjost FT and NFT group differences in that evaluation might result 
in inappropriate condixsions* Blashoff (1969) suggested that analysis 
of variance be conducted on the covariate to test the a.s sumption, but 
in the situation where testing occurs after four to six weeks of school, 
that procedure does not directly address the issue of the comparability 

the groins prior to treatecient. The implications of the failure to 
meet the assumption have not been well delineated at this time and 
deserve more careful consideration by evaluators using this technique. 

If the evaluator's concern is to assess the effect of xising a specific 
covariate (such as the WRAT) on the results obtained, then rerunning 
the analyses (without the WRAT) is ixseful. Whenever the "no- WRAT 
covariate" analyses result xn changes in conclusions for a specific 
comparison, tha.t fact and the associated results should be reported* 
The two sets of ANCOVA analyses might be performed at some specified 
level of signific^ce (such as . OS) and presented way that would 

reflect the different results that were obtained. When a large number 
of analyses and reanalyses are n^ade, it is, of course, important to 
note the number of differences found significant as a proportion of the 
total numb er of conxparisons ma.de within each program, or site* 

The second assumption we will discuss*- -that the covariates were 
mcASXired without error — has been theoretically studied, but what is 
known has seldom been apfdied in evaluation studies • Conclusions &bout 
educational pro gram, s drawn from empirical data ma^y not represent 
the situation because of failure to meet this assumption. £1 valuators 
often attempt to assess sampling error in a specific study, which is 
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often ertixa*ted from repeated uae of tlie ««me ine*««reineat proce- 
ciuxe-. The conclusion. zn»y *l.o be mialeading about .pecific v^. 
ble., .ucb a. acadeaiic achievement, because the measures inade- 
quately asses- important aspects of the variables. Neither of these 
problems can be solved with great confidence in an appUed setting, 
•o the complex adjustments are not attempted and the inadequacies' 
in measuring the variables are overlooked. 

There is also a tendency to overlook what Colexnan, Campbell, 
Hobson. McPartland, Mood, Weinfeld. and York (1966) called xneal- 
urement error. Measurement error "... includes such errors, 
among others, as ambiguities in definitions and in the questionnaire, 
failure to obtain required information from respondents, obtaining ' 
inconsistent information, mistakes in clerical coding and editing, 
error, occurring during the machine processing operation, and tabu- 
lation errors" (p. 561). m other words, it cannot be assumed that 
dexnographic and other "concrete" descriptive data are measured 
without error. 

In Abfs FT evaluation, the authors indicated that "Variables 
»«ch as s«c, ethnicity, income, occupation, education, language, 
and age are all measured with a minimum of error. It is only the 
pretest which pose, a problem" (Stebbins, 1976, p. A-60). With 
problems that e™t in most self-report data— e.peciaUy about 
variables like income and occupation among the low SES group-it is 
i=iportant that estimates of the reliabiHty of these data be obtained 
and reported when they are used as covariates (see Elashoff, 1969- 
Lord, 1962). 

An illustration of the difficulties that often arise in measuring 
what seem to be absolute entities occurred in a study at The 
•ixe of each classroom utilising the Utix: program was to be measured 
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by znaJcing a sketcli of the class roam area with the dizziensions sped- 
£ied« Usually^ this was done only once» because we felt we could 
reasonably assume that it was measured with a minimum of error* 
On one occasion^ the measurement of the classrooms was asked for 
again in the same year* The results were not at all consistent, with 
shapes as well as dimensions changing* This experience has made 
XLB eartremcly cautious about the accuracy of all types of data- -regard- 
less of their presumed simplicity. 

Coleman and his colleagues (1966)^ in the ''Squality of SIducational 
Opportunity" study ^ empirically investigated the systematic measxire- 
ment error that resulted from selected parts of their procedures* 
^valuators of all ma.jor longitudinal studies should consider estimat- 
ing and reporting the measurement error associated with their data. 

In Abt's FT evaluation, the degree of error in the WHAT pretest 
was investigated. The report stated that: "The reliability of the pre- 
test was calculated by each Follow Through Sponsor-level sample by 

*^measure'^f internal consistency-^ c o effici en t alp ha ) an d is on th e . 

order of • 90 across these samples" (Stebbins, 1976, p* A*60)* 

It is, of course, important to report the specific values for ea c h 
group on which the analysis is run, because when the value is low for 
a specific group, the conclusions drawn from the analyses xnust be 
interpreted with even more caution. Even though 90% of the groups 
have very high reliability, specific sites may fall within a large 
range of values* The reader needs to know these values for specific 
analyses* It would be helpful if the evabxator initially set a reliability 
level (such as . 80) below which the covariate would not be used (see 
Glass, Peckham, & Sanders, 1972, for reviews of studies that inves- 
tigated this concern). This does not preclude a later decision to include 
a covariate does not meet the criterion value, if there are unique 
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compelling xeMona to do so. The z-eUabilities and associated cau- 
tions should be reported, or at least noted, in the text where the 
conclusions that they affect are reported. 

A related point that irinst be raised at this time is the question- 
able use of coefficient alpha as an estimate of error in a covariate, 
such as the WRAT. This is not intended as criticism of Abt's use 
of it, given the type of data available to them and their general situa- 
tion. In fact, Abt's attempts to investigate the adequacy of the FT 
data for the ANCOVA model are to be commended. But, it is impor- 
tant that better methods be identified and utilized for testing the 
assumptions and setting appropriate criteria. Tukey (1954) and 
Wold (1956) eaqpUcated problems that arise as data analysis moves 
from experimental to observational data, of which every researcher 
must be aware. Additional work on these problems is needed for 
clarifying their implications for decision- oriented research. 

.P.^*T»^Q!!*«lK-indicated._^0-rrifrtria Tttlmte*A-ta-fh^fifi»'t^~»^~~. 

fourth assumptions Wr-re specified and used in the national FT evalua- 
tion, although both assumptions were addressed. The three criteria 
that the Abt evaluation used and reported to indicate that "the adjust- 
ment produced by ANCOVA may be misleading" (Stebbins, 1976, 
p. A- 71) were: 

1. when the relationship between a given covariate and 
outcome is different for the treatment and compari- 
son groups being analyzed; 

2. when the pretest^ifference between the treatment 
and comparison groups being analyzed is greater 
than five points (about one-half of a standard devia- 
tion); and 

3. when the percent of those attending preschool in 
each group differed by more than 50 percent, 
(p. A- 72) 
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The first criterion is essentially the ANCOVA assumption that the 
treatment groins have a common regression surface* The second is 
an indication that the treatment groups were not drawn from populations 
with the same covariate distributions • The third criterion cr>uld also 
be viewed as questioning the assumption that the groins were initially 
similar (see Campbell & Erlebacher^ 1970, for how this can lead to 
erroneous conclusions }• When any of these conditions escLstedfor a 
comparison^ or set of comparisons, their corresponding results were 
"greyed -out" in the effects table*. The specific criterion values that 
were used in greying-out a particular comparison are presented in the 
text of the report, and which of these were violated in an analysis is 
speciHed in the effects tables. This is vastly superior to presenting 
all informa^tion in a table with no indication that some of the re stilts 
are questionable* In fact, in certain situations it ma.y be xxiore appro- 
priate to report only the results of analyses that do meet of the 
mixximmxx criteria, rather than merely greying -out certain results, 
s;.uce a reader tends to assume that all data reported are meaningful* 

Note that these three criteria have associated with them ea^licit 
values or decision rxtles (such as statistical significance). Although 
the reader xnay disagree with the specific values set by the evaluator, 
s/he knows when the evaluator thinks the results are interpre table and 
what the specific criteria are on which the decisions are based« Testing 
for the violati o n of the conditions and assumptions needed for meaning-^ 
fol interpretation of ANCOVA results shoxild be done in an escperimental 
setting; however, such testing is im-perative. in an evaluation or natu- 
ralistic setting, where naturally confounded variables are almost 
certain to be present, and litfle control of the situation is ^possible* 

Considerations should not be limited, however, to those discussed 
and dealt with in tiiis paper, or in the Abt report. Others that may be 
relevant in your setting may have been excluded in the FT evaluation. 
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For example » can the criterion-covariate regreaaion be expreaaed in 
linear form for eaoh treatment group? If thia condition ia violated, 
i* e* # the data cannot be tranaformed to linear fonxx, compariaon of 
two eatimated treatment meana will be biaaed. Elaaho^ (1969) notea 
**that the effect of nonlinearity ia moat aevere when random aaaign- 
ment to groups ia not x>08sible or protection againat non«-norxxiality in 
the y"s ia lowest" (pp. 390-391). This condition can be teated by 
examining increases In explained variation when higher order terms 
are included in the regression equation. Again^ the evalua^tor should 
cpecify the exact criterion value for the test. 

In 8U imii ary> the evaluator of any prograri working in a natural- 
istic setting will find that the data available will deviate from assump- 
tions and conditions necessary for direct interpretation of ANCOVA 
results. By recognizing this fact before implementing the evaluation, 
detailed guidelines can be designed based on the purpose of the evalua- 
tion and the evaluation setting. As more data and knowledge are gained 
"from "a program- -especially a IbngxtudiHal one --modification in these 
criteria may be necessary. But* these ^guidelines with their associated 
rationales, their subsequent change^;, and their implications for the 
conclusions drawn are needed by the T-eader to understand and liiterpret 
the results of an evaluation* 

What Criteria Were Utilized to Determine 
If a Covariate Was to Be Included in a 
Specific Analysis ? 

Too often covariatea are included indiscriminately in a set of 
ANCOVA analyses without knowledge of the local conditions or a theory 
of how the variables interrelate (see Cooley & Lohnea, 1976» for a 
discussion of the importance of theory in evaluative research). This 
usually results in conservative estimates of treatment effects^ due to 
confounding of the covariates with Hie treateaent. We believe that the 
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selection o£ covmriates for an analysis sliotild be based on a logical 
rationale^ preferably one that is a part of a broader theoretical frame- 
work. Presenting the logical process used to identify candidate 
covariates indicates that the evaluator has broadly conceptualised the 
evaluation problem* Also» unique conditions in a specific situation 
often require decisions to be made about the inclusion or exclusion of 
a covariate in a specific analysis. In addition to a theoretical basis 
for including covariates, guidelines for excluding them are needed* 
TJhese guiddines for including or excluding covariates are usually 
based on the several assumptions of ANCOVA listed in the previous 
section* 

All covariates that were assessed in the evaluation and considered 
for use in a specific analysis e*iOuld be listed in the report* The ration- 
ales for their inclusion should also be presented* This point was dis- 
cussed extensively under variable identification and measurement in 
section two of ^is paper* 

In large complex evaluation s » numerous covariates are assessed, 
but the specific ones used in different analyses often vary* This '^ari- 
ation is usually due to failure to meet one of the ANCOVA assumptions, 
or to conditions unique to the situation, such as only one race involved* 
When the results of an ANCOVA analysis are presented, a list of the 
covariates considered for use and the reason for excluding any from 
the analysis should be reported^ A hypothetical example of sucb a 
list is presented in Table 4* 

Tlie criteria to decide whether or net a covariate should be included 

in a comparison should not be limited to the assumptions of ANCOVA*--'- 

For examj^e, one criterion not utilized in the Abt report was the degree 
of relationship between the covariate and outcome variable* This is 
atn important factor for assessing the effectiveness of the covariate* 
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Hypothetical Tabia of Covariatat Considcrad for an 
ANCOVA Analysis and Reasons for Dropping Those That Were Excluded 



Criteria Failed Covarietas Included 



F«ll Kindergamn WRAT 


1*.4 


Preschool Experience 


2 


Sex 


X 


Ethnic Membefihip 


X 


Occupation 


4 


*The numben refer to the ai 


sumptlona of ANCOVA listed in section 3. 



Cox (1957) compaxed the precision q£ blocking ^v^rene covmrimnce for 
different values of rfao (p), i* e« ^ the correl&tion between the covmvi^ 
&te end outcome v^jriable* Coz concluded »Hmt if p < « 4« bloddac 
ie preferable to covmriance juutlyaia; if«6<p< «8 covmrinnce la 
aomewha^t better; and if p > • 8 covmriance analyaie ia appreciably 
better* Although other factors affect thia relationahip (e« g« , ahape 
of covariate diatributiona}« it ie an in^iportant conaideratioa that 
ahould be uaed aa a criterion for ^uo^ing potential covariatee* Thia 
conaideration ie, of courae, aecondary to the purpoae o£ the evalua- 
tion and the specific queationa being addressed in any atady« 

The general inforznation on chooaing covariate a that would appear 
iu a table aisdlar to Table 4 ahould be aupplenoented with the apecific 
valuea of correlation coe^icienta for tibe aelected covariates« Tktm 
could be efficiently incorporated into a table ^?r^^^^^ ffH ^^>g the inforznation 
necessary for the reader to reconstruct the regression equations of ti&e 
comparisons diat were actaally made« A pertinent* illustration can be 
found. in the appendices of the SRI report (Enariclc et al«> 1973% In 
Table S, tbe corretaticn with the depeiic;uj± vexl«>»ley Hie Taw regroeeion 
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Tables 

Cohort I. KindergarteA, Regression on Control Variables: 
Pupil Outcomes for Project Analyses (iV - 330, Residual df* 2581^ 
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weightp the standardized regression weighty and the standard error of 
the regression coefficient are listed for each covariate* This should 
be accom|>anied by information describing and elaborating on the 
ANCOVA comparison made» such as imadjusted and adjusted outcome 
variable means, ^ the standard error of the adjusted difference, the 
sample sizes, and the actual results of the comparison in the form of 
a computed statistic or confidence interval. 

To summarize, we have recomzxiended that the evaluator delineate 
a logical rationale for the selection of variables as candidate covariates, 
preferably based on an overall theoretical framework* Guidelines 
should be specified for deciding when covariates should not be included 
in an analysis* The criteria specified should include, but not be 
limited to, the assumptions of ANCOVA* The results of this decision 
process could be presented in a table similar to Table 4. Finally^ the 
types of data that should be reported for each covariate and comparison 
tha,t allow the reader to assess the interpretations derived from the 
ANCOVA comparisons have been discussed* In large complex eval- 
uation reports, we have often found it very difficult to identify the 
variables that were included in an analysis, let alene find the reasons 
why a particxalar variable was or was not included* Criteria that might 
be used to decide which cova.riatcs to escdude from an analysis, and 
methods for clearly presenting the associated informa^tion in an eval-^ 
uation report, need further investigation and development* 



5 

Soxzie indications tixat the adjixsted xzieans have no intrinsic value, 
and that comparison of the means with their associated unadjusted 
means is not usually meaningful, should be included in the report* Of 
course, the difference between the adjusted means is used in the com^ 
putation of statistical significance* 
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Are the DlfC ;? tij'^t v ^rcmps Tw**i%^ded In t**^ 
ANCOVA A urfc^ JAA^ >d Thetr EdncmtlonmX 
Eacperlencee ^ ' "'^1>:^ d Adeonately in the 
Report? 

Much of the technical information required to assess the appro- 
priateness of the interpretation of the evaluation results ha s been 
specified in the preceding sections* We have commented on the need 
for the evaluator to? (a:) link the data analyses and research hypothe- 
ses to the general evaluation questions*, fb) specify and link the mea*- 
sures and variables utilized in an evaluation to the domains of interest* 
(c) state the criteria used to decide if an analysis should be made and 
"^*c^^«ted, and (d) specify the criteria used to select covariates for 
a specific analysis. In addition to these concerns* several others 
P^x^i^sdng to the groups' characteristics and experiences are needed 
for the conclusions to be interpreted appropriately. 

Knowledge of the educational conditions and treatments that the 
different groups ei^erienced is of central importance in interpreting 
the results of any program evaluation. A detailed description of the 
programs experienced by students is a xnajor undertaking* as evidenced 
by the e ac ten sive work in FT of Stallings (1973) at SRI and of Cooley 
and X-einhardt (1975a) at the breaming Research and Oev^elopment 
Center » Obviously* tiae extensiveness of tiie program descriptions 
represented by these studies usually cannot be achieved when conduct* 
ing an impact evaluation* but the identification of some essential 
contezt and program. variaUes should be made by evaluators in any 
setting. Igpaoring differences between intended treatments and those 
ac tu all y e3^>erienced» and between characteriptics of the '^.f^yp^y wt ^nf^^i ^ » 
and * 'comparison" groins* can lead to erroneous conclusions about the 
relative impact of the variables being evaluated. Also* litQe or no 
knowledge is gained about how the obtained outcomes had been affected 
by importaxtt program, variables. 
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Follow Through again provides a relevant example « The FT 
evaluation waa intended to assess the impact of the FT program on 
participating children* as compared to the impact of "regular" school 
experiences that did not include innovative educational programs. 
However* the "regular" school programs serving the NFT comparison 
children often included other compensatory programs^ such as Title I* 
which at times utilized educational materials and practices similar to 
those in some Sponsors* FT instructional models* As a resxat* when 
an FT/ NFT comparison is made* the appropriate interpretation of the 
results is not immediately apparent. Differences between FT and NFT 
gro^s and their educational experiences must be integrated with the 
reporting of results. A hypothetical example might be* "The NFT 
children at the Oshkosh, Alaska* site were similar to the participating 
FT children at the site on all entry characteristics measured* Because 
"the NFT chi ldren were from families whose incomes were very low* 
they qualified and participated in the Title I federal compensatory edu* 
cation program^ This involved supplemental instruction in arithmetic 
and reading and additional aid. • . . " This type of information is needed 
by the reader to interpret the results with respect to the educational 
variables actually being assessed and the degree to which differences 
in outcom.es might be e3^ecte«^^« 

In addition to considerations about the comparability of the educa^ 
tional conditions and materials the different groups experience* the 
evaluator must report inforznation about the similarities or differences 
between the groups experiencing the program being evaluated and those 
comprising the comparison grotqp. In previous sections* the necessity 
to report raw and adjusted means on the covariates and the dependent 
variables was noted* Suggestions of how and where to report the in-» 
formation was also indicated. Other aspects of each unique evaluation 
setting must also be taken into account. Within FT* some of these con* 
^id^arations relate to attrition and missing data* program requirements 



31 

EKLC 



36 



for participation, aznl local isnplemontation and utilisation of the pro* 



Attrition and missing data commonly affect the final composition 
of the groups being compared* Attrition occurs when a participating 
student moves out of the KT classroom* Missing data occurs when 
one or more measurements for a participating student are zxiissing. 
Due to these two factors, the composition of groups in the FT evalua*^ 
tion has been shown to undergo drastic changes during the course ^of 
a four-year education&l program* For example, the Abt report states 
that * 'approximately 50 percent of the FT and NFT children who are 
tested in the Idndergarten year of Cohort II were not present at the 
end of third grade** (Stebbins, 1976, p. A-47)* 

empirical investigation can be utilized to determine whether 
attrition or zziissing data bias a comparison* The Abt evaluators 
compared rates for FT and NFT students at each site, using their 
pretest scores and family income data. Five sites were found for 
which attirtion significantly changed the di^erence between groups* 
pretest scores, and three sites were identified for which attrition 
altered the FT/NFT difference in mean income. Mo explanation was 
given in the report for the selection or limitation of the investigation 
to these two variables • 

A procedure was used in the Abt report to estimate values for 
the znissing data for covariates* Whether or not a covariate value 
was estimated was then noted in the analysis* Several advantages to 
this procedure were noted in the Abt report: 

It avoids the risk of nonrepresentativeness due to dropping 
children; 

it svoids the loss of statistical power due to reduced sample 
size; 

it uses the information contained in the absence^presence 

of the variable; ^ 
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and it use* the inforxximtion present on other variable* for 
children who might have been dropped otherwise. (Stebbine^ 
1976, p. A-51) 

In any large- scale longitudinal evaluation, the evaluator will have 
the task of selecting from numerous alternative approaches for han- 
dling missing data» including dropping such persons from all analyses. 
Each situation will dictate considerations that will influence the deci- 
sion rules for handling missing data« We suggest that these rules and 
their rationales be made explicit* * How the estimation of missing data 
affects the assumption that the measures are perfectly reliable and 
how the interpretation of the results might be affected must be con- 
sidered. 

Federal requirements for the JFT program also affected the com- 
position of the FT and NFT groups: 

Children enrolled in early elementary grades may par- 
ticipate in [FT] projects. • • . At least 5 0 percent of 
the children in each entering class shall be children 
who have previously participated in a fuJl-year Head 
Start or similar quality preschool program and who 
were low income at the time of enrollment in such 
preschool program. (Federal Register, 1975, 
pp. 11714-11715) 

As a resultp entering kindergarten children could not be randomly 
selected for participation in FT* At some sites, those students 
below poverty level were assigned to the FT classroom while stu- 
dents from higher income families were assigned to the regular 
classrooms and often become part of the NFT comparison groups 
*t the site. The descriptive data do indicate the existence of this 
ffy»tematic bias caused by program requirements (see Table 6). 

Local decisions about implementation and utilisation of the FT 
program are more diOTicult to document, but no less a problem for 
adequate interpretation of results. For ejcample, local administra- 
tors often used FT as a remedial program^ Students who were 
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D — c ripti v Ch f ctai U tka of the National Population, 
iha Follow Through Ssmpla, and tha Non-Follow Through Sampla 





National 


FT 


NFT 




$9590 


S4460 


S8060 


%Minoritiai 


13% 


86% 


79% 


% Proachool 


9% 


81% 


67% 


Fatt WRAT 


NA 


29.7 


29.4 



repea.ting a grad« or who Kad special needs were often placed in the 
FT class roozxi* Tlie Abt report also indicated that a systematic bias 
exists against the FT group: "In most cases the Follow Through par-* 
ticipants were selected from among the 'most difficult* in the com-^ 
munity » • * some coznmunities chose to include the mentally handi- 
capped and/or emotionally disturbed'* (Stebbins^ 1976^ pp. A* 12 - A- 13). 

Although this was not the case at all sites, it does document that 
the '*more difficult** students were placed in FT cl^'ses* The effects 
of these differences, such as the inclusion of emotionally disturbed 
children and grade repeaters, usually reznaln unknown because they 
are not assessed by the covariates and are not investigated in other 
ways either. 

These ejcamples emphasise the need for detailed descriptions of 
the groups being compared and their educational experiences. This 
information should be coordinated with the reporting of results at the 
site level, since program interpretations depend upon ^e similarity 
of ^e groups. The results section of the Abt report did describe 
FT/NFT group comparability both in tabular and prose forms* For 
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WBmmpl«» thm FT aad NFT grou^ mt m. particular aita w«r« d«acrib#<l 
in th« r«»ult« ••ctloo for that alt at 

The FT group im alao wait balow aponaor avarag« la 
iacoma, whlla tha NFT la about avarag« for thia spoa** 
•or* • « • Tha taro group* ara a fairly cloaa matrh on 
entry WRAT and athnic compoaitioa, though tha NFT 
incoxna laval im conaidarably highar than tha FT taveU 
(Stabbina^ 1976, p. A-195) 

Thia inforrnation pmrtnitm thm raadar of tha raport to maka mora 
appropriate intarpratatlona of conclueione and othar auznmary atata* 
znanta xnada by tha avaluator about a apacifie sita by ^aVi^g hizn/har 
aware of the aizxiilaritiee in entry WRAT and ethnic compoaition 
the conaiderable ditference in income* 

Summary 

The porpoee of this paper waa to indicate eraae specific informa* 
tion that ahould be included in an evaluation report when ANCOVA^type 
techniques &re used in order to allow the reader to asaess the adeqxiacy 
of the analyses and the appropriateness of the evaloator^s interpreta- 
tions of the results • A recurrent theme of this paper has been that 
the evaluator must recognize that the application of ANCOVA in evalua- 
tion settings requires a mor*^ elaborate analysis and reporting strategy 
than in experimental studies Cue to the failure to meet assumptions 
of ANCOV A precisely and the existex&ce of numerous plausible alterna- 
tive interpretations of the results* The evaluator must recognise that 
important aspects of the evaluation should be described in detail, i.e*» 
settings treatments, characteristics of the participanta aad nonpartici- 
pants, and their educational experiences. 

The znajor points elaborated in the paper are summarized below* 
Those activities A a t evaluators often fail to carry out, or criteria 
that are sometimes not specified in the section of the Tepoxt where 
they could be mose useful, are es^phaaixed: 
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!• Specify the hypothesis actually tested hy an analysis rather 
than only z-elatin^ the analysis to a general evaluation question. 

2« Describe the variables used in each analysis, the rationale 
and procedure for selecting the xzieasures, and the relationships aznong 
the xzieasures, variable s, and evaluation questions. 

3. Use explicit criteria to decide whether or not to make a spe^ 
cific analysis and report the e^ent to which an analysis zrteets those 
criteria. 

4. Use e^^licxt criteria to decide whether to include a covariate 
in a specific analysis. 

5. Describe the groups included in the ANCOVA-type analysis 
by reporting; 

(a) adjusted and unadjusted raw or standard zneans on 
the dependent variable (s) for each group, 

(b) sxumnary statistics for each group on the covariates 
used in each analysis, and 

(c) a detailed description of the educational experiences 
of the program groups and of any comparison groups* 

These points were xnade in an effort to reduce the ambiguity that 
often ensues when reporting the results of ANCOVA techniques in cam^ 
pleac longitudinal evaluationis. As indicated by Tiakey (1954), "Ejcperi- 
znental statisticians should be honest and expository about the relation 
of precise assumptions and exactly optimuzn solutions to real situations" 
(p. 719). These considerations are intended to Improve evaluators* 
abilities to communicate their findings accurately to the nonstatis- 
tically oriented reader. A special effort is needed to indicate the 
limitations of an evaluation as well as its strengths so that a more 
balanced and accurate picture of a program and its effects is pre- 
sented to the decision maker, who may be puzzled and awed by the 
TTiathexnatical procedure s« 
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APPENDIX A 
(Quoted frozTL Stebbins, 1976, A^46) 

Given the need to provide iaforxxxatioii to decision xxxakers, the 
essential proUem becomes ^e devdopxuent o£ an evaluation approach 
that ^will provide tiie xxiost valid and comprehensive infojrmation possible. 
To this end. Follow Through evaluation planner^s (XJSOE, SRI) adopted 
a quasi* e3i^exim.ental design, selecting at each site a compairlson group 
as similar to the treatoient groxxp as possible* Since this design does 
not sixggest a single "appropriate" analysis^ we have subjected the data 
to a variety of "approacimately appropriate" analytic procedures, so as 
not to be overly confined by the drawbacks of the design* The multiple 
strategies apprpach anticipated the common and valua^ble practice of 
performing secondary analyses such as those performed on the Equality 
of Sducational Opportunity data. Any single analytic treatment of quasi- 
experimental data is inevitably subject to well-foimded methodological 
criticism^ espe cially when the data are being used to assess the impact 
of major educational programs* Subsequent reanalyses using other 
techniques and approaches help to assess the validity of the original 
results* Usually, after several reanalyses have been accomplished 
and a body of literature accumtxlated, all available information is inte** 
grated to refine and c^ATiSy understanding of the problem (or program)* 
Our analytic cros s*> validation anticipates some of the more obvious 
reanalyses and should provide other researchers with a broader basis 
for designing further ^oughtfi^ approaches to the PoUow Through data* 



API>ENDIX B 
(Quoted froxsa Cooley & X»einha.rdt, 1975b) 



Although the KFF calls for consideratioii of the ^'nonachieveznent 
factors which contribute to dassroozzi environment, " it does not spell 
out what liiese factors might be. It was suggested that the designer 
review this area and propose what definitions and instruxneixtation, if 
any, should be included in the Individualized Ins tmctlon Study* Our 
approach to this task has been two -pronged; (1) to determine whether 
xion-cogptiitive Btadent outcomes^^can and should be measured, and, (2) 
to determine whether it is possible and desirable to assess the effect 
of programs on the total classroom envirozuzient« 

We do not reco mme nd that non-cogoitive student outcomes be 
assessed in the study for two reasons. First, although schooling, 
individualized or not, may indeed have an effect on some non-cognitive 
outcomes, the theoretical basis for such a belief is not well developed* 
Without a sound basis, it is futile to attezzipt to measure non-cognitive 
or social outcomes since it is not clear what to measure or bow to 
maJce causal arguments if effects are found* A second argument against 
the test of social outcomes is that their measurement in the primary 
grades is still in a primitive state. 

Our consideration of non-cognitive or social outcomes began with 
the generation of a list of outcomes that designers of instructional pro* 
grams have claimed will be affected by their prograzns (e* g. , self- 
concept, inquiry skills, autonomy). The next step was to locate 
instruments that purport to measure these specific outcomes* The 
sbort duration of the study ruled out the possibility of developing such 
instru me nts from scratch. Sbdsting instruments were located, screened, 
and elimin a t ed from further consideration if they failed to meet any 
one of the following criteria? 
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1. Tbe ins trum ent could not be bigfaly correlated with rea^dixig 
and mathematics ability* If it were, it would measure little not al- 
ready measured by the achievement test battery. 

2* The instrument had to measure the social variables in ques- 
tion, i* e. , it had to be valid as measured by standard measures of 
validity* 

3* The instrument had to be reliable as measured by standard 
measures of reliability «. 

4. The instrtsxient must have been designed Adapted for use 
in the priznary grades. 

5. The instrument m\ist be xxsable from an administrative stand- 
point* This criterion would rule out instrument:? that are described 
in the literature but are otherwise untraceable, those that require an 
exorbitant azxxoTint of pupil/eacaminer time (in excess of three hours per 
pupil), and those that require a highly trained examiner or coder* . A 
number of projective tests like doll-play were eliminated under this 
criterion* 

The results of the search for an instrument that would xneet these 
criteria were disappointing* Not one instrument of the many considered 
was totally acceptable* Table 3. 1 lists some of the tests that were 
rejected and a criterion they failed* They may have failed other criteria, 
but this inforxxiation was not recorded because the test reviewers elim- 
inated an instrument upon failure to zneet one criterion. 
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